Dataset statistics
| Number of variables | 11 |
|---|---|
| Number of observations | 90584 |
| Missing cells | 48616 |
| Missing cells (%) | 4.9% |
| Duplicate rows | 0 |
| Duplicate rows (%) | 0.0% |
| Total size in memory | 7.6 MiB |
| Average record size in memory | 88.0 B |
Variable types
| NUM | 10 |
|---|---|
| CAT | 1 |
Reproduction
| Analysis started | 2020-08-27 16:16:58.948575 |
|---|---|
| Analysis finished | 2020-08-27 16:17:34.019913 |
| Duration | 35.07 seconds |
| Version | pandas-profiling v2.8.0 |
| Command line | pandas_profiling --config_file config.yaml [YOUR_FILE.csv] |
| Download configuration | config.yaml |
Body has a high cardinality: 90350 distinct values | High cardinality |
user_id is highly correlated with df_index | High correlation |
df_index is highly correlated with user_id | High correlation |
Views is highly correlated with Reputation | High correlation |
Reputation is highly correlated with Views | High correlation |
ViewCount has 48396 (53.4%) missing values | Missing |
ViewCount is highly skewed (γ1 = 24.95697592) | Skewed |
Body is uniformly distributed | Uniform |
df_index has unique values | Unique |
post_id has unique values | Unique |
Views has 7040 (7.8%) zeros | Zeros |
UpVotes has 22186 (24.5%) zeros | Zeros |
DownVotes has 50205 (55.4%) zeros | Zeros |
Score has 19927 (22.0%) zeros | Zeros |
CommentCount has 38051 (42.0%) zeros | Zeros |
| Distinct count | 90584 |
|---|---|
| Unique (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 45391.35905899497 |
|---|---|
| Minimum | 0 |
| Maximum | 90882 |
| Zeros | 1 |
| Zeros (%) | < 0.1% |
| Memory size | 707.7 KiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 4529.15 |
| Q1 | 22649.75 |
| median | 45372.5 |
| Q3 | 68106.25 |
| 95-th percentile | 86331.85 |
| Maximum | 90882 |
| Range | 90882 |
| Interquartile range (IQR) | 45456.5 |
Descriptive statistics
| Standard deviation | 26245.53906 |
|---|---|
| Coefficient of variation (CV) | 0.5782056234 |
| Kurtosis | -1.200790336 |
| Mean | 45391.35906 |
| Median Absolute Deviation (MAD) | 22728.5 |
| Skewness | 0.002910903364 |
| Sum | 4111730869 |
| Variance | 688828320.7 |
Histogram with fixed size bins (bins=10)
| Value | Count | Frequency (%) | |
| 2047 | 1 | < 0.1% | |
| 13011 | 1 | < 0.1% | |
| 43648 | 1 | < 0.1% | |
| 41601 | 1 | < 0.1% | |
| 47746 | 1 | < 0.1% | |
| 45699 | 1 | < 0.1% | |
| 35460 | 1 | < 0.1% | |
| 33413 | 1 | < 0.1% | |
| 39558 | 1 | < 0.1% | |
| 37511 | 1 | < 0.1% | |
| Other values (90574) | 90574 | > 99.9% |
| Value | Count | Frequency (%) | |
| 0 | 1 | < 0.1% | |
| 1 | 1 | < 0.1% | |
| 2 | 1 | < 0.1% | |
| 3 | 1 | < 0.1% | |
| 4 | 1 | < 0.1% |
| Value | Count | Frequency (%) | |
| 90882 | 1 | < 0.1% | |
| 90881 | 1 | < 0.1% | |
| 90880 | 1 | < 0.1% | |
| 90879 | 1 | < 0.1% | |
| 90878 | 1 | < 0.1% |
| Distinct count | 21983 |
|---|---|
| Unique (%) | 24.3% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 16546.764726662546 |
|---|---|
| Minimum | -1 |
| Maximum | 55746 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Memory size | 707.7 KiB |
Quantile statistics
| Minimum | -1 |
|---|---|
| 5-th percentile | 366.15 |
| Q1 | 3437 |
| median | 11032 |
| Q3 | 27700 |
| 95-th percentile | 46368 |
| Maximum | 55746 |
| Range | 55747 |
| Interquartile range (IQR) | 24263 |
Descriptive statistics
| Standard deviation | 15273.36711 |
|---|---|
| Coefficient of variation (CV) | 0.9230425017 |
| Kurtosis | -0.4539287823 |
| Mean | 16546.76473 |
| Median Absolute Deviation (MAD) | 9908 |
| Skewness | 0.8247545475 |
| Sum | 1498872136 |
| Variance | 233275742.8 |
Histogram with fixed size bins (bins=10)
| Value | Count | Frequency (%) | |
| 805 | 1720 | 1.9% | |
| 686 | 1598 | 1.8% | |
| 919 | 1204 | 1.3% | |
| 11032 | 966 | 1.1% | |
| 7290 | 827 | 0.9% | |
| 4505 | 661 | 0.7% | |
| 183 | 493 | 0.5% | |
| 930 | 458 | 0.5% | |
| 4253 | 450 | 0.5% | |
| 3382 | 425 | 0.5% | |
| Other values (21973) | 81782 | 90.3% |
| Value | Count | Frequency (%) | |
| -1 | 211 | 0.2% | |
| 5 | 117 | 0.1% | |
| 6 | 12 | < 0.1% | |
| 7 | 2 | < 0.1% | |
| 8 | 121 | 0.1% |
| Value | Count | Frequency (%) | |
| 55746 | 1 | < 0.1% | |
| 55744 | 1 | < 0.1% | |
| 55742 | 1 | < 0.1% | |
| 55738 | 1 | < 0.1% | |
| 55734 | 1 | < 0.1% |
| Distinct count | 965 |
|---|---|
| Unique (%) | 1.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 6282.395411993288 |
|---|---|
| Minimum | 1 |
| Maximum | 87393 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Memory size | 707.7 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 1 |
| Q1 | 60 |
| median | 396 |
| Q3 | 4460 |
| 95-th percentile | 37083 |
| Maximum | 87393 |
| Range | 87392 |
| Interquartile range (IQR) | 4400 |
Descriptive statistics
| Standard deviation | 15102.26867 |
|---|---|
| Coefficient of variation (CV) | 2.403902919 |
| Kurtosis | 13.43967443 |
| Mean | 6282.395412 |
| Median Absolute Deviation (MAD) | 390 |
| Skewness | 3.574815757 |
| Sum | 569084506 |
| Variance | 228078519 |
Histogram with fixed size bins (bins=10)
| Value | Count | Frequency (%) | |
| 1 | 4546 | 5.0% | |
| 6 | 3196 | 3.5% | |
| 11 | 2644 | 2.9% | |
| 65272 | 1720 | 1.9% | |
| 44152 | 1598 | 1.8% | |
| 16 | 1369 | 1.5% | |
| 87393 | 1204 | 1.3% | |
| 21 | 1172 | 1.3% | |
| 22275 | 966 | 1.1% | |
| 37083 | 827 | 0.9% | |
| Other values (955) | 71342 | 78.8% |
| Value | Count | Frequency (%) | |
| 1 | 4546 | 5.0% | |
| 2 | 12 | < 0.1% | |
| 3 | 448 | 0.5% | |
| 4 | 123 | 0.1% | |
| 5 | 26 | < 0.1% |
| Value | Count | Frequency (%) | |
| 87393 | 1204 | 1.3% | |
| 65272 | 1720 | 1.9% | |
| 44152 | 1598 | 1.8% | |
| 37083 | 827 | 0.9% | |
| 31170 | 458 | 0.5% |
| Distinct count | 361 |
|---|---|
| Unique (%) | 0.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 1034.2451757484766 |
|---|---|
| Minimum | 0 |
| Maximum | 20932 |
| Zeros | 7040 |
| Zeros (%) | 7.8% |
| Memory size | 707.7 KiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 0 |
| Q1 | 5 |
| median | 45 |
| Q3 | 514.25 |
| 95-th percentile | 5680 |
| Maximum | 20932 |
| Range | 20932 |
| Interquartile range (IQR) | 509.25 |
Descriptive statistics
| Standard deviation | 2880.074012 |
|---|---|
| Coefficient of variation (CV) | 2.784711091 |
| Kurtosis | 28.31236715 |
| Mean | 1034.245176 |
| Median Absolute Deviation (MAD) | 44 |
| Skewness | 4.873839918 |
| Sum | 93686065 |
| Variance | 8294826.315 |
Histogram with fixed size bins (bins=10)
| Value | Count | Frequency (%) | |
| 0 | 7040 | 7.8% | |
| 1 | 5209 | 5.8% | |
| 2 | 3819 | 4.2% | |
| 3 | 2889 | 3.2% | |
| 4 | 2330 | 2.6% | |
| 5 | 1817 | 2.0% | |
| 6 | 1726 | 1.9% | |
| 5680 | 1720 | 1.9% | |
| 7357 | 1598 | 1.8% | |
| 7 | 1533 | 1.7% | |
| Other values (351) | 60903 | 67.2% |
| Value | Count | Frequency (%) | |
| 0 | 7040 | 7.8% | |
| 1 | 5209 | 5.8% | |
| 2 | 3819 | 4.2% | |
| 3 | 2889 | 3.2% | |
| 4 | 2330 | 2.6% |
| Value | Count | Frequency (%) | |
| 20932 | 1204 | 1.3% | |
| 7395 | 966 | 1.1% | |
| 7357 | 1598 | 1.8% | |
| 6948 | 450 | 0.5% | |
| 5927 | 266 | 0.3% |
| Distinct count | 330 |
|---|---|
| Unique (%) | 0.4% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 734.3157180075951 |
|---|---|
| Minimum | 0 |
| Maximum | 11442 |
| Zeros | 22186 |
| Zeros (%) | 24.5% |
| Memory size | 707.7 KiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 0 |
| Q1 | 1 |
| median | 22 |
| Q3 | 283 |
| 95-th percentile | 5007 |
| Maximum | 11442 |
| Range | 11442 |
| Interquartile range (IQR) | 282 |
Descriptive statistics
| Standard deviation | 2050.869327 |
|---|---|
| Coefficient of variation (CV) | 2.792898581 |
| Kurtosis | 14.31300098 |
| Mean | 734.315718 |
| Median Absolute Deviation (MAD) | 22 |
| Skewness | 3.790593781 |
| Sum | 66517255 |
| Variance | 4206064.997 |
Histogram with fixed size bins (bins=10)
| Value | Count | Frequency (%) | |
| 0 | 22186 | 24.5% | |
| 1 | 3681 | 4.1% | |
| 2 | 2887 | 3.2% | |
| 3 | 1952 | 2.2% | |
| 7035 | 1720 | 1.9% | |
| 4 | 1705 | 1.9% | |
| 2156 | 1598 | 1.8% | |
| 6 | 1583 | 1.7% | |
| 5 | 1315 | 1.5% | |
| 7 | 1220 | 1.3% | |
| Other values (320) | 50737 | 56.0% |
| Value | Count | Frequency (%) | |
| 0 | 22186 | 24.5% | |
| 1 | 3681 | 4.1% | |
| 2 | 2887 | 3.2% | |
| 3 | 1952 | 2.2% | |
| 4 | 1705 | 1.9% |
| Value | Count | Frequency (%) | |
| 11442 | 187 | 0.2% | |
| 11273 | 1204 | 1.3% | |
| 10523 | 458 | 0.5% | |
| 8641 | 827 | 0.9% | |
| 7035 | 1720 | 1.9% |
| Distinct count | 76 |
|---|---|
| Unique (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 33.27324913892078 |
|---|---|
| Minimum | 0 |
| Maximum | 1920 |
| Zeros | 50205 |
| Zeros (%) | 55.4% |
| Memory size | 707.7 KiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 0 |
| Q1 | 0 |
| median | 0 |
| Q3 | 8 |
| 95-th percentile | 143 |
| Maximum | 1920 |
| Range | 1920 |
| Interquartile range (IQR) | 8 |
Descriptive statistics
| Standard deviation | 134.9364354 |
|---|---|
| Coefficient of variation (CV) | 4.05540303 |
| Kurtosis | 98.77457755 |
| Mean | 33.27324914 |
| Median Absolute Deviation (MAD) | 0 |
| Skewness | 8.757213426 |
| Sum | 3014024 |
| Variance | 18207.84159 |
Histogram with fixed size bins (bins=10)
| Value | Count | Frequency (%) | |
| 0 | 50205 | 55.4% | |
| 1 | 4628 | 5.1% | |
| 2 | 3250 | 3.6% | |
| 4 | 2226 | 2.5% | |
| 143 | 1989 | 2.2% | |
| 3 | 1905 | 2.1% | |
| 6 | 1855 | 2.0% | |
| 5 | 1846 | 2.0% | |
| 82 | 1598 | 1.8% | |
| 8 | 1562 | 1.7% | |
| Other values (66) | 19520 | 21.5% |
| Value | Count | Frequency (%) | |
| 0 | 50205 | 55.4% | |
| 1 | 4628 | 5.1% | |
| 2 | 3250 | 3.6% | |
| 3 | 1905 | 2.1% | |
| 4 | 2226 | 2.5% |
| Value | Count | Frequency (%) | |
| 1920 | 211 | 0.2% | |
| 779 | 1204 | 1.3% | |
| 412 | 266 | 0.3% | |
| 351 | 291 | 0.3% | |
| 214 | 458 | 0.5% |
| Distinct count | 90584 |
|---|---|
| Unique (%) | 100.0% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 56539.08052194648 |
|---|---|
| Minimum | 1 |
| Maximum | 115378 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Memory size | 707.7 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 5315.15 |
| Q1 | 26051.75 |
| median | 57225.5 |
| Q3 | 86145.25 |
| 95-th percentile | 110267.85 |
| Maximum | 115378 |
| Range | 115377 |
| Interquartile range (IQR) | 60093.5 |
Descriptive statistics
| Standard deviation | 33840.30753 |
|---|---|
| Coefficient of variation (CV) | 0.5985294988 |
| Kurtosis | -1.231769907 |
| Mean | 56539.08052 |
| Median Absolute Deviation (MAD) | 30031 |
| Skewness | 0.03591388347 |
| Sum | 5121536070 |
| Variance | 1145166414 |
Histogram with fixed size bins (bins=10)
| Value | Count | Frequency (%) | |
| 4094 | 1 | < 0.1% | |
| 29403 | 1 | < 0.1% | |
| 51852 | 1 | < 0.1% | |
| 49805 | 1 | < 0.1% | |
| 55950 | 1 | < 0.1% | |
| 8849 | 1 | < 0.1% | |
| 14994 | 1 | < 0.1% | |
| 12947 | 1 | < 0.1% | |
| 2708 | 1 | < 0.1% | |
| 661 | 1 | < 0.1% | |
| Other values (90574) | 90574 | > 99.9% |
| Value | Count | Frequency (%) | |
| 1 | 1 | < 0.1% | |
| 2 | 1 | < 0.1% | |
| 3 | 1 | < 0.1% | |
| 4 | 1 | < 0.1% | |
| 5 | 1 | < 0.1% |
| Value | Count | Frequency (%) | |
| 115378 | 1 | < 0.1% | |
| 115377 | 1 | < 0.1% | |
| 115376 | 1 | < 0.1% | |
| 115375 | 1 | < 0.1% | |
| 115374 | 1 | < 0.1% |
| Distinct count | 128 |
|---|---|
| Unique (%) | 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 2.7807670228737966 |
|---|---|
| Minimum | -19 |
| Maximum | 192 |
| Zeros | 19927 |
| Zeros (%) | 22.0% |
| Memory size | 707.7 KiB |
Quantile statistics
| Minimum | -19 |
|---|---|
| 5-th percentile | 0 |
| Q1 | 1 |
| median | 2 |
| Q3 | 3 |
| 95-th percentile | 9 |
| Maximum | 192 |
| Range | 211 |
| Interquartile range (IQR) | 2 |
Descriptive statistics
| Standard deviation | 4.948921899 |
|---|---|
| Coefficient of variation (CV) | 1.779696702 |
| Kurtosis | 192.5905972 |
| Mean | 2.780767023 |
| Median Absolute Deviation (MAD) | 1 |
| Skewness | 9.827873481 |
| Sum | 251893 |
| Variance | 24.49182796 |
Histogram with fixed size bins (bins=10)
| Value | Count | Frequency (%) | |
| 1 | 22901 | 25.3% | |
| 0 | 19927 | 22.0% | |
| 2 | 15248 | 16.8% | |
| 3 | 9909 | 10.9% | |
| 4 | 6210 | 6.9% | |
| 5 | 4142 | 4.6% | |
| 6 | 2849 | 3.1% | |
| 7 | 1941 | 2.1% | |
| 8 | 1305 | 1.4% | |
| 9 | 950 | 1.0% | |
| Other values (118) | 5202 | 5.7% |
| Value | Count | Frequency (%) | |
| -19 | 2 | < 0.1% | |
| -13 | 1 | < 0.1% | |
| -10 | 1 | < 0.1% | |
| -9 | 2 | < 0.1% | |
| -8 | 2 | < 0.1% |
| Value | Count | Frequency (%) | |
| 192 | 1 | < 0.1% | |
| 184 | 1 | < 0.1% | |
| 164 | 1 | < 0.1% | |
| 156 | 1 | < 0.1% | |
| 152 | 1 | < 0.1% |
| Distinct count | 3654 |
|---|---|
| Unique (%) | 8.7% |
| Missing | 48396 |
| Missing (%) | 53.4% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 556.6561581492367 |
|---|---|
| Minimum | 1.0 |
| Maximum | 175495.0 |
| Zeros | 0 |
| Zeros (%) | 0.0% |
| Memory size | 707.7 KiB |
Quantile statistics
| Minimum | 1 |
|---|---|
| 5-th percentile | 19 |
| Q1 | 53 |
| median | 126 |
| Q3 | 367 |
| 95-th percentile | 2107.95 |
| Maximum | 175495 |
| Range | 175494 |
| Interquartile range (IQR) | 314 |
Descriptive statistics
| Standard deviation | 2356.930779 |
|---|---|
| Coefficient of variation (CV) | 4.234087317 |
| Kurtosis | 1135.338873 |
| Mean | 556.6561581 |
| Median Absolute Deviation (MAD) | 91 |
| Skewness | 24.95697592 |
| Sum | 23484210 |
| Variance | 5555122.698 |
Histogram with fixed size bins (bins=10)
| Value | Count | Frequency (%) | |
| 38 | 295 | 0.3% | |
| 31 | 293 | 0.3% | |
| 37 | 277 | 0.3% | |
| 27 | 277 | 0.3% | |
| 24 | 274 | 0.3% | |
| 36 | 272 | 0.3% | |
| 30 | 270 | 0.3% | |
| 33 | 262 | 0.3% | |
| 25 | 262 | 0.3% | |
| 32 | 261 | 0.3% | |
| Other values (3644) | 39445 | 43.5% | |
| (Missing) | 48396 | 53.4% |
| Value | Count | Frequency (%) | |
| 1 | 1 | < 0.1% | |
| 2 | 5 | < 0.1% | |
| 3 | 6 | < 0.1% | |
| 4 | 20 | < 0.1% | |
| 5 | 33 | < 0.1% |
| Value | Count | Frequency (%) | |
| 175495 | 1 | < 0.1% | |
| 98109 | 1 | < 0.1% | |
| 92612 | 1 | < 0.1% | |
| 91848 | 1 | < 0.1% | |
| 88129 | 1 | < 0.1% |
| Distinct count | 39 |
|---|---|
| Unique (%) | < 0.1% |
| Missing | 0 |
| Missing (%) | 0.0% |
| Infinite | 0 |
| Infinite (%) | 0.0% |
| Mean | 1.894650269363243 |
|---|---|
| Minimum | 0 |
| Maximum | 45 |
| Zeros | 38051 |
| Zeros (%) | 42.0% |
| Memory size | 707.7 KiB |
Quantile statistics
| Minimum | 0 |
|---|---|
| 5-th percentile | 0 |
| Q1 | 0 |
| median | 1 |
| Q3 | 3 |
| 95-th percentile | 7 |
| Maximum | 45 |
| Range | 45 |
| Interquartile range (IQR) | 3 |
Descriptive statistics
| Standard deviation | 2.638704141 |
|---|---|
| Coefficient of variation (CV) | 1.392713042 |
| Kurtosis | 12.44510758 |
| Mean | 1.894650269 |
| Median Absolute Deviation (MAD) | 1 |
| Skewness | 2.574211733 |
| Sum | 171625 |
| Variance | 6.962759541 |
Histogram with fixed size bins (bins=10)
| Value | Count | Frequency (%) | |
| 0 | 38051 | 42.0% | |
| 1 | 14798 | 16.3% | |
| 2 | 12527 | 13.8% | |
| 3 | 7835 | 8.6% | |
| 4 | 5560 | 6.1% | |
| 5 | 3651 | 4.0% | |
| 6 | 2601 | 2.9% | |
| 7 | 1701 | 1.9% | |
| 8 | 1198 | 1.3% | |
| 9 | 835 | 0.9% | |
| Other values (29) | 1827 | 2.0% |
| Value | Count | Frequency (%) | |
| 0 | 38051 | 42.0% | |
| 1 | 14798 | 16.3% | |
| 2 | 12527 | 13.8% | |
| 3 | 7835 | 8.6% | |
| 4 | 5560 | 6.1% |
| Value | Count | Frequency (%) | |
| 45 | 1 | < 0.1% | |
| 41 | 2 | < 0.1% | |
| 37 | 2 | < 0.1% | |
| 35 | 2 | < 0.1% | |
| 34 | 1 | < 0.1% |
| Distinct count | 90350 |
|---|---|
| Unique (%) | > 99.9% |
| Missing | 220 |
| Missing (%) | 0.2% |
| Memory size | 707.7 KiB |
| <p>So I am developing this application for rating books (think like IMDB for books) using relational database. </p> <p><strong>Problem statement :</strong></p> <p>Let's say book "<strong>A</strong>" deserves 8.5 in absolute sense. In case if A is the best book I have ever seen, I'll most probably rate it > 9.5 whereas for someone else, it might be just an average book, so he/they will rate it less (say around 8). Let's assume 4 such guys rate it 8.</p> <p>If there are 10 guys who are like me (who haven't ever read great literature) and they all rate it 9.5-10. This will effectively make it's cumulative rating greater than 9 (9.5*10 + 8*4) / 14 = 9.1</p> <p>whereas we needed the result to be 8.5 ... How can I take care of(normalize) this bias due to incorrect perception of individuals.</p> <p><strong>MyProposedSolution :</strong></p> <p>Here's one of the ways how I think it could be solved. We can have a variable <strong>Lit_coefficient</strong> which tells us how much knowledge a user has about literature. If I rate "<strong>A</strong>"(the book) 9.5 and person "<strong>X</strong>" rates it 8, then he must have read books much better than "<strong>A</strong>" and thus his Lit_coefficient should be higher. And then we can normalize the ratings according to the Lit_coefficient of user. Could there be a better algorithm/solution for the same?</p> | 2 |
|---|---|
| <p><a href="http://en.wikipedia.org/wiki/Proportional_hazards_models" rel="nofollow">Cox proportional hazards regression</a> is a very popular, semi-parametric method for survival analysis. </p> <p>It is semi-parametric in that the baseline hazard is left unspecified, but parameters for the effects of covariates are estimated. Eliminating the possibility of misspecifying the baseline makes the beta estimates more robust.</p> <p><em>Proportional hazards</em> means that no matter what the baseline hazard may be at any point in time, the ceteris paribus effect of a one-unit increase in a covariate is a constant multiple of the baseline hazard. </p> | 2 |
| <p>In a MCMC implementation of hierarchical models, with normal random effects and a Wishart prior for their covariance matrix, Gibbs sampling is typically used.</p> <p>However, if we change the distribution of the random effects (e.g., to Student's-t or another one), the conjugacy is lost. In this case, what would be a suitable (i.e., easily tunable) proposal distribution for the covariance matrix of the random effects in a Metropolis-Hastings algorithm, and what should be the target acceptance rate, again 0.234?</p> <p>Thanks in advance for any pointers.</p> | 2 |
| <p>So I'm looking to compare different combinations of features and classifiers. But I'm getting a lot of combinations that achieve 100% cross validation accuracy. I'm trying to figure out how I would compare the usefulness of each combination.</p> <p>For example I can both train an SVM using Features 1, 10, 15 to get 100% accuracy. But at the same time I can train a logistic regression classifier only using Feature 7 to get 100% accuracy. Also this is a binary classification problem.</p> | 2 |
| <p>Actually, <strong>frequent itemset mining</strong> may be a better choice than clustering on such data.</p> <p>The usual vector-oriented set of algorithms does not make a lot of sense. K-means for example will produce means that are no longer binary.</p> | 2 |
| Other values (90345) |
| Value | Count | Frequency (%) | |
| <p>So I am developing this application for rating books (think like IMDB for books) using relational database. </p> <p><strong>Problem statement :</strong></p> <p>Let's say book "<strong>A</strong>" deserves 8.5 in absolute sense. In case if A is the best book I have ever seen, I'll most probably rate it > 9.5 whereas for someone else, it might be just an average book, so he/they will rate it less (say around 8). Let's assume 4 such guys rate it 8.</p> <p>If there are 10 guys who are like me (who haven't ever read great literature) and they all rate it 9.5-10. This will effectively make it's cumulative rating greater than 9 (9.5*10 + 8*4) / 14 = 9.1</p> <p>whereas we needed the result to be 8.5 ... How can I take care of(normalize) this bias due to incorrect perception of individuals.</p> <p><strong>MyProposedSolution :</strong></p> <p>Here's one of the ways how I think it could be solved. We can have a variable <strong>Lit_coefficient</strong> which tells us how much knowledge a user has about literature. If I rate "<strong>A</strong>"(the book) 9.5 and person "<strong>X</strong>" rates it 8, then he must have read books much better than "<strong>A</strong>" and thus his Lit_coefficient should be higher. And then we can normalize the ratings according to the Lit_coefficient of user. Could there be a better algorithm/solution for the same?</p> | 2 | < 0.1% | |
| <p><a href="http://en.wikipedia.org/wiki/Proportional_hazards_models" rel="nofollow">Cox proportional hazards regression</a> is a very popular, semi-parametric method for survival analysis. </p> <p>It is semi-parametric in that the baseline hazard is left unspecified, but parameters for the effects of covariates are estimated. Eliminating the possibility of misspecifying the baseline makes the beta estimates more robust.</p> <p><em>Proportional hazards</em> means that no matter what the baseline hazard may be at any point in time, the ceteris paribus effect of a one-unit increase in a covariate is a constant multiple of the baseline hazard. </p> | 2 | < 0.1% | |
| <p>In a MCMC implementation of hierarchical models, with normal random effects and a Wishart prior for their covariance matrix, Gibbs sampling is typically used.</p> <p>However, if we change the distribution of the random effects (e.g., to Student's-t or another one), the conjugacy is lost. In this case, what would be a suitable (i.e., easily tunable) proposal distribution for the covariance matrix of the random effects in a Metropolis-Hastings algorithm, and what should be the target acceptance rate, again 0.234?</p> <p>Thanks in advance for any pointers.</p> | 2 | < 0.1% | |
| <p>So I'm looking to compare different combinations of features and classifiers. But I'm getting a lot of combinations that achieve 100% cross validation accuracy. I'm trying to figure out how I would compare the usefulness of each combination.</p> <p>For example I can both train an SVM using Features 1, 10, 15 to get 100% accuracy. But at the same time I can train a logistic regression classifier only using Feature 7 to get 100% accuracy. Also this is a binary classification problem.</p> | 2 | < 0.1% | |
| <p>Actually, <strong>frequent itemset mining</strong> may be a better choice than clustering on such data.</p> <p>The usual vector-oriented set of algorithms does not make a lot of sense. K-means for example will produce means that are no longer binary.</p> | 2 | < 0.1% | |
| <p>I understand that fuzzy clustering using FCM produces a membership matrix for the set of data points we feed to it. What characteristics will an anomalous cluster produced during this method have? (Considering I only have unlabelled data)</p> | 2 | < 0.1% | |
| Hidden Markov Models are used for modelling systems that are assumed to be Markov processes with hidden (i.e. unobserved) states. | 2 | < 0.1% | |
| <p><a href="http://www.math.umass.edu/~lavine/Book/book.html">Introduction to Statistical Thought</a></p> | 2 | < 0.1% | |
| <p>I'm trying to improve a factory quality control.</p> <p>I have some variables from the melting process (something like ten control variables) that changes trough time (a matrix of the values of those control per minute), and in the end I have a quality score for the final product (one single variable). I have one of this for each production batch.</p> <p>I want to know if you guys can help me saying how can I look for correlations with the quality score inside that matrix.</p> <p>I know that I can look to each control variable alone, but those variables interfere with each other. So it is necessary to look at the sistem as a one.</p> <p>Thank you.</p> | 2 | < 0.1% | |
| <p>Given a i.i.d sample $X_{1},..,X_{n}$ of bernoulli random variables test 2 hypotheses $H_{0}:p=2/3$ and $H_{1}:p=1/3$. Bayesian prior is $\\pi(2/3)=1/3$ and $\\pi(1/3)=2/3$. Find the bayesian criterion for acceptng $H_{0}$, find the bayesian mean square error for the test and for $n=8$ compute this mean square error using normal approximation</p> <p>I have found the bayesian criterion for acceptance as $\\sum_{i=1}^{n}x_{i}{\\geq}\\frac{n+1-log_{2}(\\alpha^{-1}-1)}{2}$. where $\\alpha$ is a value is chosen prior to the test. How do you do the other two parts? </p> <p>Thanks</p> | 2 | < 0.1% | |
| Other values (90340) | 90344 | 99.7% | |
| (Missing) | 220 | 0.2% |
Length
| Max length | 38847 |
|---|---|
| Median length | 815 |
| Mean length | 1128.987846 |
| Min length | 3 |
Pearson's r
The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
Spearman's ρ
The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
Kendall's τ
Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
Phik (φk)
Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.First rows
| df_index | user_id | Reputation | Views | UpVotes | DownVotes | post_id | Score | ViewCount | CommentCount | Body | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | -1 | 1 | 0 | 5007 | 1920 | 2175 | 0 | NaN | 0 | <p><strong>CrossValidated</strong> is for statisticians, data miners, and anyone else doing data analysis or interested in it as a discipline. If you have a question about</p>\n\n<ul>\n<li><strong>statistical analysis</strong>, applied or theoretical</li>\n<li><strong>designing experiments</strong></li>\n<li><strong>collecting data</strong></li>\n<li><strong>data mining</strong></li>\n<li><strong>machine learning</strong></li>\n<li><strong>visualizing data</strong></li>\n<li><strong>probability theory</strong></li>\n<li><strong>mathematical statistics</strong></li>\n<li>statistical and data-driven <strong>computing</strong></li>\n</ul>\n\n<p>then you're in the right place. Anybody can ask a question, regardless of skills and experience, but some questions are still better than others. If you came here with a question to ask and are new to the site, please consult our thread on <a href="http://meta.stats.stackexchange.com/questions/1479/how-to-ask-a-good-question-on-crossvalidated">how to ask a good question</a>.</p>\n\n<p>Our community aims to create a lasting record of great solutions to questions. For more about this and guidance about how to provide your own great answers, please read <a href="http://meta.stats.stackexchange.com/questions/1390/how-should-questions-be-answered-on-cross-validated">How should questions be answered on Cross Validated?</a>. Providing references to peer-reviewed literature or links to on-line resources is warmly welcomed. You can also incorporate the work of others under <a href="http://en.wikipedia.org/wiki/Fair_use" rel="nofollow">fair use doctrine</a>, which particularly means that you <em>must</em> attribute any text, images, or other material that is not originally yours.</p>\n\n<p><strong>Homework</strong> questions are welcome. <em>Please mark them with the <a href="http://stats.stackexchange.com/questions/tagged/homework">homework</a> tag</em>. They get <a href="http://meta.stackexchange.com/questions/10811/how-to-ask-and-answer-homework-questions/10812#10812">somewhat special treatment</a>, because ultimately you benefit most by finding the solution <em>yourself.</em> The community will try to provide <a href="http://meta.stats.stackexchange.com/q/12/919">guidance, hints, and useful links</a>.</p>\n\n<p><em>There are certain subjects that will probably get better responses on our sister sites</em>. If your question is about</p>\n\n<ul>\n<li><strong>Programming</strong>, ask on <a href="http://stackoverflow.com">Stack Overflow</a>. If the language is statistically oriented (such as <strong>R</strong>, <strong>SAS</strong>, <strong>Stata</strong>, <strong>SPSS</strong>, etc.), then decide based on the nature of your question: if it needs <em>statistical expertise</em> to understand or answer, ask it here; if it's about an <em>algorithm</em>, routine <em>data processing</em>, or details of the <em>language</em>, then please refer to the <a href="http://meta.stats.stackexchange.com/questions/793/internet-support-for-statistics-software">collection of links to resources</a> we maintain.</li>\n<li><strong>Mathematics</strong>, ask on <a href="http://math.stackexchange.com">math.stackexchange.com</a>.</li>\n<li><strong>Bugs in software</strong>, ask the people who produced the software.</li>\n</ul>\n\n<p>Questions about <strong>obtaining particular datasets</strong> are off-topic (they are too specialized). The <a href="http://gis.stackexchange.com">GIS site</a> welcomes inquiries about obtaining geographically related datasets.</p>\n\n<p>Please note, however, that <em>cross-posting is not encouraged</em> on SE sites. Choose one best location to post your question. Later, if it proves better suited on another site, it can be <em>migrated</em>.</p>\n |
| 1 | 1 | -1 | 1 | 0 | 5007 | 1920 | 8576 | 0 | NaN | 0 | NaN |
| 2 | 2 | -1 | 1 | 0 | 5007 | 1920 | 8578 | 0 | NaN | 0 | NaN |
| 3 | 3 | -1 | 1 | 0 | 5007 | 1920 | 8981 | 0 | NaN | 0 | <p>"Statistics" can refer variously to the (wide) field of statistical theory and statistical analysis; to constructing functions of data as used in formal procedures; to collections of data; and to summaries of data.</p>\n\n<p>Because this site is about statistics and statistical analysis, it is rare that tagging a question with "statistics" will be informative. Use of this tag will signal that your question is extremely general and broad.</p>\n |
| 4 | 4 | -1 | 1 | 0 | 5007 | 1920 | 8982 | 0 | NaN | 0 | This generic tag is only rarely suitable; use it with caution. Consider selecting more specific, descriptive tags. |
| 5 | 5 | -1 | 1 | 0 | 5007 | 1920 | 9857 | 0 | NaN | 0 | NaN |
| 6 | 6 | -1 | 1 | 0 | 5007 | 1920 | 9858 | 0 | NaN | 0 | Linear regression is a type of regression when regression function is linear. It is most widely used regression type. |
| 7 | 7 | -1 | 1 | 0 | 5007 | 1920 | 9860 | 0 | NaN | 0 | NaN |
| 8 | 8 | -1 | 1 | 0 | 5007 | 1920 | 10130 | 0 | NaN | 0 | NaN |
| 9 | 9 | -1 | 1 | 0 | 5007 | 1920 | 10131 | 0 | NaN | 0 | NaN |
Last rows
| df_index | user_id | Reputation | Views | UpVotes | DownVotes | post_id | Score | ViewCount | CommentCount | Body | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 90574 | 90873 | 55724 | 16 | 1 | 0 | 0 | 115335 | 3 | 19.0 | 1 | <p>I have a set of objects, each of which can be assigned to another object within the set in a many-to-one, directional assignment, like a vote. Objects cannot be assigned to themselves reflexively, but they can be in an unassigned state ("not voting"). So for example, the set <code>{A, ..., Z}</code> plus relation could be in the following state:</p>\n\n<pre><code>A -> E\nE -> C\nB -> C\nC -> B\nD -> Y\n{F, ..., Z} are not voting.\n</code></pre>\n\n<p>This state will change over time, in discrete steps, one vote at a time. For example, at time <code>t = 0</code> the set could be as above. Then at <code>t = 1</code>, <code>Z</code> votes for <code>C</code> (<code>Z -> C</code>), resulting in the state:</p>\n\n<pre><code>A -> E\nE -> C\nB -> C\nC -> B\nD -> Y\nZ -> C\n{F, ..., Y} are unassigned.\n</code></pre>\n\n<p>Can anyone come up with a neat way to illustrate graphically this set and the state changes over time? Ideally, it should be evident at a glance</p>\n\n<ul>\n<li>How many votes there are for one object at a given time.</li>\n<li>Which objects are voting for another given object.</li>\n</ul>\n\n<p>Thanks!</p>\n |
| 90575 | 90874 | 55729 | 16 | 1 | 0 | 0 | 115338 | 0 | NaN | 0 | <p>All items are not solved fully correctly in the question. I would recommend the following.</p>\n\n<p>(0) The observations "y" do not need to be corrected as they are between 0 and 1 already. Applying the correction shouldn't create problems but it's not necessary either.</p>\n\n<p>(1) cannot be answered by the likelihood ratio (LR) test. Generally in mixture models, the selection of the number of components cannot be based on the LR test because its regularity assumptions are not fulfilled. Instead, information criteria are often used and "flexmix" upon which betamix() is based offers AIC, BIC, and ICL. So you could choose the best BIC solution among 1, 2, 3 clusters via</p>\n\n<pre><code>library("flexmix")\nset.seed(0)\nm <- betamix(y ~ 1 | 1, data = d, k = 1:3)\n</code></pre>\n\n<p>(2) The parameters in betamix() are not mu and phi directly but additionally link functions are employed for both parameters. The defaults are logit and log, respectively. This ensure that the parameters are in their valid ranges (0, 1) and (0, inf), respectively. One could refit the models in both components to get easier access to the links and inverse links etc. However, here it is probably easiest to apply the inverse links by hand:</p>\n\n<pre><code>mu <- plogis(coef(m)[,1])\nphi <- exp(coef(m)[,2])\n</code></pre>\n\n<p>This shows that the means are very different (0.25 and 0.77) while the precisions are rather similar (49.4 and 47.8). Then we can transform back to alpha and beta which gives 12.4, 37.0 and 36.7, 11.1 which is reasonably close to the original parameters in the simulation:</p>\n\n<pre><code>a <- mu * phi\nb <- (1 - mu) * phi\n</code></pre>\n\n<p>(3) The clusters can be extracted using the clusters() function. This simply selects the component with the highest posterior() probability. In this case, the posterior() is really clear-cut, i.e., either close to zero or close to 1.</p>\n\n<pre><code>cl <- clusters(m)\n</code></pre>\n\n<p>(4) When visualizing the data with histograms, one can either visualize both components separately, i.e., each with its own density function. Or one can draw one joint histogram with the corresponding joint density. The difference is that the latter needs to factor in the different cluster sizes: the prior weights are about 1/3 and 2/3 here. The separate histograms can be drawn like this:</p>\n\n<pre><code>## separate histograms for both clusters\nhist(subset(d, cl == 1)$y, breaks = 0:25/25, freq = FALSE,\n col = hcl(0, 50, 80), main = "", xlab = "y", ylim = c(0, 9))\n\nhist(subset(d, cl == 2)$y, breaks = 0:25/25, freq = FALSE,\n col = hcl(240, 50, 80), main = "", xlab = "y", ylim = c(0, 9), add = TRUE)\n\n## lines for fitted densities\nys <- seq(0, 1, by = 0.01)\nlines(ys, dbeta(ys, shape1 = a[1], shape2 = b[1]),\n col = hcl(0, 80, 50), lwd = 2)\nlines(ys, dbeta(ys, shape1 = a[2], shape2 = b[2]),\n col = hcl(240, 80, 50), lwd = 2)\n\n## lines for corresponding means\nabline(v = mu[1], col = hcl(0, 80, 50), lty = 2, lwd = 2)\nabline(v = mu[2], col = hcl(240, 80, 50), lty = 2, lwd = 2)\n</code></pre>\n\n<p>And the joint histogram:</p>\n\n<pre><code>p <- prior(m$flexmix)\n hist(d$y, breaks = 0:25/25, freq = FALSE,\n main = "", xlab = "y", ylim = c(0, 4.5))\nlines(ys, p[1] * dbeta(ys, shape1 = a[1], shape2 = b[1]) +\n p[2] * dbeta(ys, shape1 = a[2], shape2 = b[2]), lwd = 2)\n</code></pre>\n\n<p>The resulting figure is included below.</p>\n\n<p><img src="http://i.stack.imgur.com/7dcW4.png" alt="enter image description here"></p>\n |
| 90576 | 90875 | 55730 | 1 | 0 | 0 | 0 | 115340 | 0 | 18.0 | 1 | <p>I am working on decision trees for the first time at job. I have done lot of research on CHAID and CART algorithms but find different answers to a very simple question given below :</p>\n\n<p><strong>What kind of target variables CART can have?</strong></p>\n\n<p>I understand that CART can help both in prediction and classification. I confirm that the target variable for regression tree is continuous. I am getting different answers in various research papers with regards to the target variable of classification tree. Can someone please help me with this?</p>\n\n<p>Further, I need to do some analysis on the following :</p>\n\n<p>1)Identifying frauds : I intend to use classification trees of CART/CHAID/logistic regression\n2) Forecasting losses : I intend to use linear regression or regression trees of CART\n3) Identifying cheque bounce customers : I intend to use classification trees OF CART/CHAID or logistic regression</p>\n\n<p>Kindly suggest if this is the right way to go....</p>\n\n<p>thanks\nquants_mum</p>\n |
| 90577 | 90876 | 55731 | 1 | 0 | 0 | 0 | 115350 | 0 | 3.0 | 0 | <p>How do we specify negative costs in rpart? The documentation says the diagonals of the loss matrix should be zero. Is there an alternative to specify the benefits of correct classification (that is, the negative cost)?</p>\n |
| 90578 | 90877 | 55733 | 6 | 1 | 0 | 0 | 115356 | 1 | 15.0 | 0 | <p>I'm a beginner in statistics and I have to run multilevel logistic regressions. I am confused with the results as they differ from logistic regression with just one level. </p>\n\n<p>I don't know how to interpret the variance and correlation of the random variables. And I wonder how to compute the ICC.</p>\n\n<p>For example : I have a dependent variable about the protection friendship ties give to individuals (1 is for individuals who can rely a lot on their friends, 0 is for the others). There are 50 geographic clusters of respondant and one random variable which is a factor about the social situation of the neighborhood. Upper/middle class is the reference, the other modalities are working class and underprivileged neighborhoods. </p>\n\n<p>I get these results :</p>\n\n<pre><code>> summary(RLM3)\nGeneralized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']\n Family: binomial ( logit )\nFormula: Arp ~ Densite2 + Sexe + Age + Etudes + pcs1 + Enfants + Origine3 + Sante + Religion + LPO + Sexe * Enfants + Rev + (1 + Strate | \n Quartier)\n Data: LPE\nWeights: PONDERATION\nControl: glmerControl(optimizer = "bobyqa")\n\n AIC BIC logLik deviance df.resid \n 3389.9 3538.3 -1669.9 3339.9 2778 \n\nScaled residuals: \n Min 1Q Median 3Q Max \n-3.2216 -0.7573 -0.3601 0.8794 2.7833 \n\nRandom effects:\n Groups Name Variance Std.Dev. Corr \n Neighb. (Intercept) 0.2021 0.4495 \n Working Cl. 0.2021 0.4495 -1.00 \n Underpriv. 0.2021 0.4495 -1.00 1.00\nNumber of obs: 2803, groups: Neigh., 50\n\nFixed effects:\n</code></pre>\n\n<p>The differences with the "call" part is due to the fact I translated some words.</p>\n\n<p>I think I understand the relation between the random intercept and the random slope for linear regressions but it is more difficult for logistics ones. I guess that when the correlation is positive, I can conclude that the type of neighborhood (social context) has a positive impact on the protectiveness of friendship ties, and conversely. But how do I quantify that ?</p>\n\n<p>Moreover, I find it odd to get correlation of 1 or -1 and nothing more intermediate.</p>\n\n<p>As for the ICC I am puzzled because I have seen a post about lmer regression that indicates that intraclass correlation can be computed by dividing the variance of the random intercept by the variance of the random intercept, plus the variance the random variables, plus the residuals. </p>\n\n<p>But there are no residuals in the results of a glmer. I have read in a book that ICC must be computed by dividing the random intercept variance by the random intercept variance plus 2.36 (pi²/3). But in another book, 2.36 was replaced by the inter-group variance (the first level variance I guess). \nWhat is the good solution ?</p>\n\n<p>I hope these questions are not too confused.\nThank you for your attention !</p>\n |
| 90579 | 90878 | 55734 | 1 | 0 | 0 | 0 | 115352 | 0 | 16.0 | 0 | <p>For example, I was looking at <a href="http://en.wikipedia.org/wiki/10-second_barrier" rel="nofollow">this list of the 93 people</a> who have broken the "10-second-barrier", after reading that sprinter Christophe Lemaitre was the first person of purely European decent to break the barrier, which got me to wondering what the difference between the mean sprinting times for whites vs blacks. Unfortunately, that number is probably not known since it would require making thousands of average people sprint, and even then it wouldn't necessarily reflect the "true genetic" difference, since the people sprinting were not training for sprinting. So if you wanted to measure the genetic component of the difference, it might be more accurate to measure only the fastest people in the world who are equally motivated and have been training for years, and therefore have eliminated the non-genetic disadvantages, at least that would be my theory.</p>\n\n<p>So if you could get a list of say the top 1000 fastest times in the 100 meter dash, and say 20 people on that list are white, could you use that data to give some estimation on what the full distributions look like and/or find what the mean of those distributions are? How?</p>\n\n<hr>\n\n<p>QUESTION is above^^ this is just some rambling:</p>\n\n<p>I would guess that if you were trying to find the difference in means between blacks and whites 100 meter sprinting times AFTER everyone in each population had trained for years, lost excess weight ect, i.e. you only want to measure the genetic difference in maximum potential, then measuring average people will not be the way to go, since none of them will have trained to reach their maximum potential, thus there will be many non-genetic factors causing differences, and the other problem is that the difference between trained and untrained, may not be the same, so if you want to measure difference in max potential, it would be better to look at the 1000 fastest, rather than 1000 average people. Also, the data for the 1000 fastest is very high quality, since it was done with laser timing and under official supervision, whereas data gathered from some fitness survey done at a few high schools would probably be of low quality.</p>\n |
| 90580 | 90879 | 55738 | 11 | 0 | 0 | 0 | 115360 | 2 | 40.0 | 4 | <p>Is Student's t test a Wald test?</p>\n\n<p>I've read the description of Wald tests from Wasserman's <em>All of Statistics</em>.</p>\n\n<p>It seems to me that the Wald test includes t-tests. Is that correct? If not, what makes a t-test not a Wald test?</p>\n |
| 90581 | 90880 | 55742 | 6 | 0 | 0 | 0 | 115366 | 1 | 17.0 | 0 | <p>Does any standard statistical software like R, SAS or SPSS have procedures or codes to analyze log-linear models for missing data in contingency tables using maximum likelihood estimation (or EM algorithm or other iterative procedures), not multiple imputation techniques ? </p>\n |
| 90582 | 90881 | 55744 | 6 | 1 | 0 | 0 | 115370 | 1 | 13.0 | 2 | <p>im analyzing an article for my studies with the hypothesis if a change in work motivation ist related with a change in mental well being (<a href="http://www.sciencedirect.com/science...01879113001541" rel="nofollow">http://www.sciencedirect.com/science...01879113001541</a>). Sadly i dont know much about poisson regression. The follow up measurement was 18 month later. Do you consider always the time, when you do a poisson regression?Im not quite sure if they did so in this study.. If i imagine a graph of this regression, what can i see on the x and what on the y axis? Thanks for your help</p>\n |
| 90583 | 90882 | 55746 | 106 | 1 | 0 | 0 | 115376 | 1 | 5.0 | 2 | <p>My goal is to create a formula that can give an indication of how a YouTube channel's video will perform in the first 30 days of its lifespan and eliminate viral video / "lightening in a bottle" outliers that may be on the channel. The goal is to use the resulting number to price a video from a specific YouTube channel. </p>\n\n<p>As an example, a hypothetical YouTube channel uploads approx 10 videos a month.\nVariables: </p>\n\n<ol>\n<li>Some videos get shared more and are more "viral"</li>\n<li>Videos have a "fat head" and "long tail." Fat head refers to the largest chunk of viewership which in the case of established YouTube channels happens upfront, and the long tail refers to accumulated views over succeeding months.</li>\n</ol>\n\n<p>These viewcounts belong to videos that the same channel uploaded in the last 30 days (from most recent in descending order): </p>\n\n<pre><code> 351,170 \n 770,783 \n1,183,166 \n 154,645 \n1,568,569 \n2,564,857 \n1,023,498 \n1,409,113 \n1,006,203 \n1,244,092 \n</code></pre>\n\n<p>So my questions: </p>\n\n<ol>\n<li>Is there a formula I could plug in to my spreadsheet given this data that could accurately come up with a conservative estimation of how the video will perform in the first 30 days? </li>\n<li>If not, how can I create one? </li>\n<li>Because some of these videos are still generating a "fat head" (like the most recently published video with 351,170 views) would it make sense to instead gather and average videos uploaded in the last 30-60 days instead? (fat head has time to impact viewcount and settle)</li>\n</ol>\n |